Branch removal by data shuffling

ABSTRACT

A system and method for automatically optimizing parallel execution of multiple work units in a processor by reducing a number of branch instructions. A computing system includes a first processor core with a general-purpose micro-architecture and a second processor core with a same instruction multiple data (SIMD) micro-architecture. A compiler detects and evaluates branches within function calls with one or more records of data used to determine one or more outcomes. Multiple compute sub-kernels are generated, each comprising code from the function corresponding to a unique outcome of the branch. Multiple work units are produced by assigning one or more records of data corresponding to a given outcome of the branch to one of the multiple compute sub-kernels associated with the given outcome. The branch is removed. An operating system scheduler schedules each of the one or more compute sub-kernels to the first processor core or to the second processor core.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to computing systems, and more particularly, toautomatically optimizing parallel execution of multiple work units in aprocessor by reducing a number of control flow transfer instructions.

2. Description of the Relevant Art

The parallelization of tasks is used to increase the throughput ofcomputer systems. To this end, compilers may extract parallelized tasksfrom program code to execute in parallel on the system hardware. Withsingle-core architecture, a single core may include deep pipelinesconfigured to perform multi-threading. To further increase parallelexecution on the hardware, a multi-core architecture may includemultiple processor cores. This type of architecture may be referred toas a homogeneous multi-core architecture and may provide higherinstruction throughput than single-core architecture. However,particular instructions for a computationally intensive task may consumea disproportionate share of a shared resource, which may in turn delaydeallocation of the shared resource. Examples of such specific tasks mayinclude cryptography, video graphics rendering, and garbage collection.

To overcome the performance limitations of conventional general-purposecores, a computer system may offload specific tasks to special-purposehardware. This hardware may include a single instruction multiple data(SIMD) parallel architecture, a field-programmable gate array (FPGA),and/or other specialized types of processing cores. When an architectureincludes multiple cores of different types it may be referred to as aheterogeneous multi-core architecture. Depending on the scheduling oftasks, this type of architecture may provide higher instructionthroughput than a homogeneous multi-core architecture.

The OpenCL® (Open Computing Language) framework supports programmingacross heterogeneous computing environments and includes a low-levelapplication programming interface (API) for heterogeneous computing. TheOpenCL framework (generally referred to herein as “OpenCL”) includes aC-like language. In the OpenCL framework a function call may be referredto as an OpenCL kernel, or simply a “kernel”. A software kernel may bematched with one or more records of data to produce one or more workunits of computation. Generally speaking, a SIMD architecture offersgood computing performance and cost efficiency when executing such dataparallel workloads. However, performance may be greatly reduced if theparallel workload includes irregular, data-dependent branch behavior. Awork unit may be data independent from another work unit, but it mayhave data dependence within itself. A conditional test implemented as abranch instruction may pass for a first work unit, but fail for a secondwork unit.

During execution of an If-Then-Else construct statement, within eachcolumn of a SIMD architecture is one or more execution units configuredto execute the “Then” and the “Else” paths. Current practice includesexecuting each of the available paths and selectively disabling theexecution units corresponding to work units that did not choose thecurrent path. The efficiency of parallel execution may be reduced as thesecond work unit halts execution and waits in an idle state as the firstwork unit continues with its ongoing execution during a given pipestage.

SUMMARY OF EMBODIMENTS OF THE INVENTION

Systems and methods for automatically optimizing parallel execution ofmultiple work units in a processor by reducing a number of control flowtransfer instructions are contemplated.

In one embodiment, a computing system includes a first processor corewith a first micro-architecture and a second processor core with asecond micro-architecture different from the first micro-architecture.In one embodiment, the first micro-architecture is a general-purposemicro-architecture and the second micro-architecture is a sameinstruction multiple data (SIMD) micro-architecture. The computingsystem includes a memory coupled to each of the first and the secondprocessor cores. The memory stores a computer program comprising one ormore compute kernels, or function calls. As a compiler traverses theinstructions of a given function call, the compiler is configured toidentify a control flow transfer instruction, such as a conditionalbranch. To evaluate the branch, the compiler may utilize one of thefirst and the second processor cores. One or more records of data may beused to determine one or more outcomes.

Multiple compute sub-kernels may be generated, each comprising code fromthe function corresponding to a unique outcome of the branch. Multiplework units are produced, each invoked in the compiled computer programby assigning one or more records of data corresponding to a givenoutcome of the branch to one of the multiple compute sub-kernelsassociated with the given outcome. The branch may be removed. In oneembodiment, the assigning comprises moving said one or more records ofdata to a same group location in a memory for sequential or strideaccess. In another embodiment, the assigning comprises remapping accessfrom originally assigned sequential records to said one or more records.A scheduler within an operating system (OS) schedules for execution eachof the one or more compute sub-kernels to the first processor core or tothe second processor core.

These and other embodiments will be further appreciated upon referenceto the following description and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a generalized block diagram of one embodiment of an exemplaryprocessing node with a heterogeneous multi-core architecture.

FIG. 2 is a generalized block diagram of one embodiment of source codeutilizing compute kernels.

FIG. 3 is a generalized block diagram of one embodiment of source codedefining compute kernels with conditional statements.

FIG. 4 is a generalized block diagram of one embodiment of scheduledassignments between hardware resources and compute kernels.

FIG. 5 is a generalized block diagram of one embodiment of a logicallayout of micro-architectures for two types of processor cores.

FIG. 6 is a generalized block diagram of one embodiment of ageneral-purpose pipeline execution flow.

FIG. 7 is a generalized block diagram of one embodiment of a SIMDpipeline execution flow.

FIG. 8 is a generalized block diagram illustrating one embodiment ofcode transformation by removing control flow transfer functions.

FIG. 9 is another generalized block diagram illustrating one embodimentof code transformation by removing control flow transfer functions.

FIG. 10 is a generalized block diagram illustrating one embodiment ofcode transformation by removing control flow transfer instructions andgenerating sub-kernels.

FIG. 11 is a generalized flow diagram illustrating one embodiment of amethod for optimizing parallel execution of multiple work units in aprocessor by utilizing pre-runtime data information.

FIG. 12 is a generalized block diagram illustrating one embodiment ofdata shuffling in memory.

FIG. 13 is a generalized block diagram illustrating one embodiment ofcreating an index array for data access.

FIG. 14 is a generalized flow diagram illustrating one embodiment of amethod for creating an index array for data access.

FIG. 15 is a generalized block diagram illustrating one embodiment of analgorithm for generation index arrays.

FIG. 16 is a generalized block diagram illustrating one embodiment ofsource code defining the compute kernels utilizing the index arrays.

FIG. 17 is a generalized block diagram illustrating one embodiment ofindex array generation for two branches.

While the invention is susceptible to various modifications andalternative forms, specific embodiments are shown by way of example inthe drawings and are herein described in detail. It should beunderstood, however, that drawings and detailed description thereto arenot intended to limit the invention to the particular form disclosed,but on the contrary, the invention is to cover all modifications,equivalents and alternatives falling within the spirit and scope of thepresent invention as defined by the appended claims.

DETAILED DESCRIPTION

In the following description, numerous specific details are set forth toprovide a thorough understanding of the present invention. However, onehaving ordinary skill in the art should recognize that the inventionmight be practiced without these specific details. In some instances,well-known circuits, structures, and techniques have not been shown indetail to avoid obscuring the present invention.

Referring to FIG. 1, one embodiment of an exemplary processing node 110with a heterogeneous multi-core architecture is shown. Processing node110 may include one or more processing units 115, which may include oneor more processor cores 112 and an associated cache memory subsystem114. In one embodiment, processor core 112 utilizes a general-purposemicro-architecture.

Processing node 110 may also include one or more processing units 170,which may comprise one or more processor cores 172 and data storagebuffers 174. Processor core 172 may not be a mirrored silicon image ofprocessor core 112. Processor core 172 may have a micro-architecturedifferent from the micro-architecture used by processor core 112. In oneembodiment, the processor core 172 may be a different generation of asame processor family as processor core 112. In another embodiment, theprocessor core 172 may be a voltage and/or frequency scaled version ofprocessor core 112. In other words, the processor core 172 is not asilicon copy of the processor core 112 with a same functionality andinstruction set architecture (ISA), a same clock frequency, same cachesizes, a same memory model, and so forth.

Continuing with the micro-architecture of processor core 172, in yetanother embodiment, the processor core 172 may comprise amicro-architecture that provides high instruction throughput for acomputational intensive task. Processor core 172 may have a parallelarchitecture. For example, the processor core 172 may be a singleinstruction multiple data (SIMD) core. Examples of SIMD cores includegraphics processing units (GPUs), digital signal processing (DSP) cores,or other. In one embodiment, the processing node 110 comprises a singleinstruction set architecture (ISA). Typically, as is well known in theart, single-ISA multi-core architectures have been shown to providehigher power and throughput performances for chip multiprocessors (CMP).

High instruction throughput on processing node 110 may be achieved withmeasured power consumption within a given power limit when threads ofsoftware applications are efficiently scheduled. The threads may bescheduled on one of processor cores 112 and 172 in a manner that eachthread has the highest instruction throughput based at least in part onthe runtime hardware resources of the processor cores 112 and 172.

Continuing with the components in the processing node 110, theprocessing node 110 may include memory controller 120, and interfacelogic 140. In one embodiment, the illustrated functionality ofprocessing node 110 is incorporated upon a single integrated circuit. Inone embodiment, processor cores 112 include circuitry for executinginstructions according to a predefined general-purpose instruction set.For example, the SPARC® instruction set architecture (ISA) may beselected. Alternatively, the x86, x86-64®, Alpha®, PowerPC®, MIPS®,PA-RISC®, or any other instruction set architecture may be selected.Generally, processor core 112 accesses the cache memory subsystems 114,respectively, for data and instructions. If the requested block is notfound in cache memory subsystem 114 or in shared cache memory subsystem118, then a read request may be generated and transmitted to the memorycontroller within the node to which the missing block is mapped.

In one embodiment, processing unit 170 is a graphics processing unit(GPU). Modern GPUs are very efficient at manipulating and displayingcomputer graphics. The highly parallel structure of GPUs makes them moreeffective than general-purpose central processing units (CPUs), such asprocessing unit 115, for a range of complex algorithms. Typically, a GPUexecutes calculations used for graphics and video and a CPU executescalculations for many more system processes than graphics alone.Conventional GPUs utilize very wide single instruction multiple data(SIMD) architectures to achieve high throughput in image-renderingapplications. Such applications generally entail executing the sameprograms, such as vertex shaders or pixel shaders, on large numbers ofobjects (vertices or pixels). Since each object is processedindependently of other objects, but the same sequence of operations isused, a SIMD architecture provides considerable performance enhancement.GPUs have also been considered for non-graphical calculations.

In one embodiment, the GPU 170 may be located on a video card. Inanother embodiment, the GPU 170 may be integrated on the motherboard. Inyet another embodiment, the illustrated functionality of processing node110 may be incorporated upon a single integrated circuit. In such anembodiment, the CPU 115 and the GPU 170 may be proprietary cores fromdifferent design centers. Also, the GPU 170 may now be able to directlyaccess both local memories 114 and 118 and main memory via memorycontroller 120 from the processing node 110, rather than perform memoryaccesses off-chip via interface 140. This embodiment may lower latencyfor memory accesses for the GPU 170, which may translate into higherperformance.

Continuing with the components of processing node 110 in FIG. 1, cachesubsystems 114 and 118 may comprise high-speed cache memories configuredto store blocks of data. Cache memory subsystems 114 may be integratedwithin respective processor cores 112. Alternatively, cache memorysubsystems 114 may be coupled to processor cores 114 in a backside cacheconfiguration or an inline configuration, as desired. Still further,cache memory subsystems 114 may be implemented as a hierarchy of caches.Caches that are located nearer processor cores 112 (within thehierarchy) may be integrated into processor cores 112, if desired. Inone embodiment, cache memory subsystems 114 each represent L2 cachestructures, and shared cache subsystem 118 represents an L3 cachestructure. Both the cache memory subsystem 114 and the shared cachememory subsystem 118 may include a cache memory coupled to acorresponding cache controller.

Generally, packet processing logic 116 is configured to respond tocontrol packets received on the links to which processing node 110 iscoupled, to generate control packets in response to processor cores 112and/or cache memory subsystems 114, to generate probe commands andresponse packets in response to transactions selected by memorycontroller 120 for service, and to route packets for which node 110 isan intermediate node to other nodes through interface logic 140.Interface logic 140 may include logic to receive packets and synchronizethe packets to an internal clock used by packet processing logic 116.

Tuning now to FIG. 2, one embodiment of source code utilizing computekernels is shown. OpenCL™ (Open Computing Language) is one example of alow-level application programming interface (API) for heterogeneouscomputing. OpenCL includes a C-like language that defines executionqueues, wherein each queue is associated with an OpenCL device. AnOpenCL device may be a CPU, a GPU, or other unit with at least oneprocessor core within the heterogeneous multi-core architecture. Afunction call may be referred to as an OpenCL kernel, or simply a“compute kernel”. The OpenCL framework may improve computing performancefor a wide variety of data-parallel applications used in gaming,entertainment, science and medical fields. For a heterogeneousarchitecture, a computer program typically comprises a collection ofcompute kernels and internal functions. A software programmer may definethe compute kernels, whereas the internal functions may be defined in agiven library.

For a data-parallel software application, an N-Dimensional computationdomain may define an organization of an “execution domain”. TheN-Dimensional computation domain may also be referred to as anN-Dimensional grid or an N-Dimensional Range (“NDRange”). The NDRangemay be a one-, two-, or three-dimensional space. This dimensional spacemay also be referred to as an index space. For example, a softwareapplication may perform data processing on a two-dimensional (2D) arrayof data, such as an image file. The software application may perform analgorithm developed by a software programmer on a pixel-by-pixel basisof a 2D image. A given compute kernel may be invoked over the indexspace (the NDRange).

Typically after compilation, the arguments and parameters of eachcompute kernel are set. Additionally, associated memory objects andbuffers are created. A given instance of the compute kernel may beexecuted as its own software thread. However, a compute kernel mayinclude control flow transfer instructions that create forks, whereas afork in a computer program typically creates a software thread, bycommon definition. A given instance of the compute kernel at a givenpoint in the index space may be referred to as a work unit or work item.A work unit may operate with the one or more instructions in the computekernel on a record of data corresponding to a given pixel (a givenindex) of the 2D image. Typically, work units have an associated uniqueidentifier (ID). In another example, an introductory computer programprocessing the string “Hello World” may have one work unit for computingeach letter in the string.

The NDRange may define a total number of work units that execute inparallel if there is sufficient hardware support. For example, theNDRange may define a number of 280 work units, but a GPU may support thesimultaneous execution of 64 work units at any given time. The totalnumber of work units may define a global work size. As is well known tothose skilled in the art, the work units may be further grouped intowork groups. Each work group may have a unique identifier (ID). The workunits within a given work group may be able to communicate with eachother and synchronize execution and coordinate memory accesses. A numberof work units may be clustered into a wave front for simultaneousexecution on a GPU in a SIMD manner. Regarding the example above for 280total work units, a wave front may include 64 work units.

The OpenCL framework is an open programming standard for various computedevices, or OpenCL devices. A software programmer may avoid writing avendor-specific code, which may result in improved code portability.Other frameworks are available and may offer more vendor-specific codingfor heterogeneous architectures. For example, NVIDIA offers ComputeUnified Device Architecture (CUDA®) and AMD offers ATI Stream®. With aCUDA framework, a compute kernel is typically statically compiled whenthe computer program is compiled. With an OpenCL framework, a computekernel is typically compiled with a Just-In-Time (JIT) method. The JITmethod may generate an appropriate binary code after obtaining thesystem configuration. With a JIT compilation method, the compilationtime is included with the total execution time. Therefore, compileroptimizations may increase the execution time. In addition, at run timethe OpenCL compiler generates multiple versions of compute kernels. Oneversion of a compute kernel may be generated for each type of OpenCLdevice type, such as a general-purpose CPU, a SIMD GPU, and so forth.

The two frameworks, OpenCL and CUDA, have a difference in terminologybetween their respective execution models. For example, a work unit, awork group, a wave front and an NDRange in OpenCL have correspondingterms in CUDA such as a thread, a thread block, a warp and a grid.Throughout the rest of the description, the terms corresponding toOpenCL are used. However, the systems and methods described may apply toCUDA, ATI Stream and other frameworks.

As shown in FIG. 2, code 210 defines two function calls entitled“doWorkA” and “doWorkB”. Each function call may be referred to as a“compute kernel”. A compute kernel may be matched with one or morerecords of data to produce one or more computational work units.Therefore, two or more work units may utilize the same instructions ofthe single function call, but operate on different records of data. Forexample, the function call “Power2” in code 220 may be used to execute10 work units, one for each data value in the array “INPUT”. Here, arecord comprises a single data value. In other examples, a record maycomprise two or more fields, wherein each field includes a data value. ASIMD micro-architecture may efficiently execute the instructions of thekernel “Power2”, calculate the power of 2 for the values in the INPUTarray and write the output to the RESULT array.

The OpenCL framework may invoke an instance of a compute kernel multipletimes in parallel. Each call to the compute kernel has one associatedunique ID (a work unit ID) that may be fetched by calling an internalfunction named get_global_id(0). Regarding the above example in code220, the compute kernel “Power2” is invoked once for each data value inthe INPUT array. In this case, the compute kernel “Power2” is invoked 10times. Accordingly, ten unique work unit IDs are fetched. With a JITcompiling method, these instances are invoked at runtime. The OpenCLframework may differentiate between these different instances byutilizing the unique work unit IDs. The data to be operated on (arecord) may also be specified, such as a specific data value in theINPUT array. Therefore, at runtime, a work unit may be scheduled bydefault to a same OpenCL device as the associated compute kernel isscheduled.

Tuning now to FIG. 3, one embodiment of source code defining computekernels with conditional statements is shown. Similar to code 210, thecode 230 shown in FIG. 3 defines two function calls entitled “doWorkA”and “doWorkB”. Again, each function call may be referred to as a“compute kernel”. Here, only one of the two compute kernels may beexecuted during runtime. The selection of which compute kernel isexecuted is based on a conditional test provided by the function call“EvaluateFunction”. A result of a given instruction or whether the giveninstruction is executed is data-dependent on the execution of previousinstructions and data corresponding to an associated record. If theresult of the conditional test is not consistent among a wave front ofwork units, the benefits of a SIMD micro-architecture may be reduced.For example, a given SIMD core may have 64 parallel computation unitsavailable for simultaneous execution of 64 work units. However, if halfof the 64 work units pass the conditional test while the other halffails the conditional test, then only half of the parallel computationunits are utilized during a given stage of processing.

Turning now to FIG. 4, a generalized block diagram illustrating oneembodiment of scheduled assignments 400 between hardware resources andcompute kernels is shown. Here, the partitioning of hardware andsoftware resources and their interrelationships and assignments duringthe execution of one or more software applications 430 is shown. In oneembodiment, an operating system 420 allocates regions of memory forcompute kernels 440 a-440 j and 440 k-440 q. When applications 430, orcomputer programs, execute, each application may comprise multiplecompute kernels. For example, a first executing application may comprisecompute kernels 440 a-440 j and a second executing application maycomprise compute kernels 440 k-440 q. Each one of the kernels 440 a-440q may be used to generate one or more work units by being combined withone or more records of data (not shown). For example, compute kernel 440a may produce work units 442 a-442 d, compute kernel 440 j may producework units 442 e-442 h, compute kernel 440 k may produce work units 442j-442 m and compute kernel 440 q may produce work units 442 n-442 q. Awork unit may execute independently of other work units and executeconcurrently with other work units.

Each of the compute kernels shown in FIG. 4 may own its own resourcessuch as an image of memory, or an instance of instructions and databefore application execution. Each of the compute kernels may alsocomprise process-specific information such as address space thataddresses the code, data, and possibly a heap and a stack; variables indata and control registers such as stack pointers, general andfloating-point registers, program counter, and otherwise; and operatingsystem descriptors such as stdin, stdout, and otherwise, and securityattributes such as a set of permissions.

In one embodiment, hardware computing system 410 incorporates ageneral-purpose processor core 112 and a SIMD processor core 172, eachconfigured to process one or more work units. In another embodiment,system 410 includes two other heterogeneous processor cores. In general,for a given application, operating system 420 sets up an address spacefor the application, loads the application's code into memory, sets up astack for the program, branches to a given location inside theapplication, and begins execution of the application. Typically, theportion of the operating system 420 that manages such activities is theoperating system (OS) compute kernel 422. The OS compute kernel 422 isreferred to as “OS compute kernel” in order not to confuse it with acompute kernel, or a function call. The OS Compute kernel 422 mayfurther determine a course of action when insufficient memory isavailable for the execution of the application. As stated before, anapplication may be divided into more than one compute kernel and system410 may be running more than one application. Therefore, there may beseveral compute kernels running in parallel. The OS Compute kernel 422may decide at any time which of the simultaneous executing computekernels is allocated to the processor cores 112 and 172. The OS Computekernel 422 may allow a process to run on a core of a processor, whichmay have one or more cores, for a given amount of time referred to as atime slice. An OS scheduler 424 in the operating system 420 may comprisedecision logic for assigning compute kernels to cores.

In one embodiment, only one compute kernel can execute at any time onany one of the hardware computation units 412 a-412 g and 412 h-412 r.These hardware computation units comprise hardware that can handle theexecution of a given instruction of a given work unit with associateddata. This hardware may include an arithmetic logic unit that isconfigured to perform addition, multiplication, zero detect, a bit-wiseshift, division, video graphics and multimedia instructions or otheroperations known to those skilled in the art of processor design. Thesehardware computation units may include a hardware thread in amulti-threaded processor, a parallel hardware column in a SIMDmicro-architecture, and so forth.

The dashed lines in FIG. 4 denote assignments and do not necessarilydenote direct physical connections. Thus, for example, hardwarecomputation unit 412 a may be assigned to execute work unit 442 d.However, later (e.g., after a context switch), the hardware computationunit 412 a may be assigned to execute work unit 442 h. In oneembodiment, the OS scheduler 424 may schedule the work units 442 a-442 qto the hardware computation units 412 a-412 r with a round-robin scheme.Alternatively, the OS scheduler 424 may schedule the work units 442a-442 q to the cores 112 and 172 with a round-robin scheme. Anassignment of a given work unit to a given hardware computation unit maybe performed by an associated processor core. In another embodiment, theOS scheduler 424 may perform the scheduling based on availability of theprocessor cores 112 and 172. In yet another embodiment, the OS scheduler424 may perform the scheduling according to assignments created by aprogrammer utilizing the OpenCL™ API or another similar API. Thesescheduling schemes may restrict portability and performance when thereis a mismatch between the work unit assignments and hardware resources.

Referring to FIG. 5, a generalized block diagram illustrating oneembodiment of a logical layout of micro-architectures for two types ofprocessor cores is shown. Although each of a general-purpose core 510and a single instruction multiple data (SIMD) core 560 is shown, othertypes of heterogeneous cores are possible and contemplated. Each of thecores 510 and 560 have a dynamic random access memory (DRAM) 550 a and550 b for storage of data and instructions. In one embodiment, the cores510 and 560 share a same DRAM. In another embodiment, a given level of acache memory subsystem (not shown) is shared in addition to the DRAM.For example, referring again to FIG. 1, the cache memory subsystem 118is shared by the cores 112 and 172.

Each of the cores 510 and 560 include a cache memory subsystem 530. Asshown, the general-purpose core 510 logically has the cache memorysubsystem 530 separate from the control logic 520 and the arithmeticlogic units (ALUs) 540. The data flow within the core 510 may bepipelined, although storage elements, such as pipeline registers, arenot shown in order to simplify the illustration. In a given pipelinestage, an ALU may be unused if instructions in this stage do not utilizea certain type of ALU or if another work unit (or another thread for ageneral-purpose core) consumes the ALUs during this stage.

As shown, the SIMD core 560 has the cache memory subsystem 530 groupedwith control logic 520 for each row of computation units 542. The dataflow within the core 560 may be pipelined, although storage elements,such as pipeline registers, are not shown in order to simplify theillustration. In a given pipeline stage, a computation unit may beunused if an associated instruction in this stage is not executed basedon a previous failed test, such as a not-taken branch.

Referring now to FIG. 6, a generalized block diagram illustrating oneembodiment of a general-purpose pipeline execution flow 600 is shown.Instructions 602-608 may be fetched and enter a general-purposepipeline. Instruction 606 may be a computation intensive instruction.During particular stages of the pipeline execution flow, one or more ofthe instructions 602-608 consume resources in the general-purposeprocessor core 112, such as decoder logic, instruction schedulerentries, reorder buffer entries, ALUs, register file entries, branchprediction units, and so forth.

In a balanced scheme, each of the instructions 602-608 consume an equalamount of resources each stage. However, typically, a general-purposecore does not replicate resources for each instruction due toreal-estate cost, power consumption and other design considerations.Therefore, the workload may become unbalanced. For example, theinstruction 606 may consume more resources for one or more pipe stagesdue to its computation intensive behavior. As shown, the resources 630consumed by this instruction may become far greater than the resourcesconsumed by other instructions. In fact, the computation intensiveinstruction may block the usage of hardware resources by otherinstructions.

Some computation intensive tasks may place pressure on shared resourceswithin the general-purpose core 112. Thus, throughput losses occur forboth the computational intensive process and other processes waiting forthe shared resources. In addition, some instructions occupy the sharedresource and other resources on the die to support the computation beingperformed on the shared resource. Such a long latency instruction mayconcurrently block other processes from using several resources during along latency.

Referring now to FIG. 7, a generalized block diagram illustrating oneembodiment of a SIMD pipeline execution flow 700 is shown. Instructions702-708 may be fetched and enter a SIMD pipeline with associated data.Instruction 704 may be a control flow transfer instruction, such as abranch. The instruction 706 may be a first instruction in a taken path.For example, the branch instruction 704 may be associated with an IFstatement in a high-level language program. The instruction 706 may beassociated with a THEN statement in the high-level language program. Theinstruction 708 may be a first instruction in a not-taken path. Theinstruction 708 may be associated with an ELSE statement in thehigh-level language program.

Each of the computation units within a given row may be a samecomputation unit. Each of these computation units may operate on a sameinstruction, but different data associated with a different work unit.As shown, some of the work units pass the test provided by the branchinstruction 704 and other work units fail the test. The SIMD core 172may execute each of the available paths and selectively disable theexecution units, such as the computation units, corresponding to workunits that did not choose the current path. For example, duringexecution of an If-Then-Else construct statement, within each column ofa SIMD architecture are execution units configured to execute the “Then”(Path A) and the “Else” (Path B) paths. The efficiency of parallelexecution may be reduced as the first and the second work units haltexecution and wait as the third work unit continues with its ongoingexecution. Therefore, not all of the computation units are activecomputation units 710 in a given row after execution of the branchinstruction 704. If a large number of computation units are inactiveduring a given pipe stage, the efficiency and throughput of the SIMDcore is reduced.

Referring now to FIG. 8, a generalized block diagram illustrating oneembodiment of code transformation by removing control flow transferfunctions is shown. Similar to code 210 and code 230 shown in FIG. 2 andFIG. 3, the code 232 defines two function calls entitled “doWorkA” and“doWorkB”. Again, each function call may be referred to as a “computekernel”. In the example shown, the code 230 has been transformed intothe code 232, wherein the conditional IF statement with a function“EvaluateFunction” has been removed.

Only one of the two compute kernels in code 232 is executed duringruntime for a given group of records. For example, the conditional IFstatement may evaluate to “true” if a given record has an even datavalue. Similarly, the conditional IF statement may evaluate to “false”if a given record has an odd data value. For a given array with datavalues 1 to 10, the even records {2, 4, 6, 8, 10} may evaluate to trueand the function doWorkA is executed. The odd records {1, 3, 5, 7, 9}may evaluate to false and the function doWorkB is executed. After thecode transformation from code 230 to code 232, scheduling may includecombining the function “KernelFunctionA” with the even records {2, 4, 6,8, 10} to generate five work units. Similarly, scheduling may includecombining the function “KernelFunctionB” with the odd records {1, 3, 5,7, 9} to generate five additional work units. In one embodiment, thecombining of the compute kernel code with a record of data may beperformed by shuffling, or rearranging, the records in memory intogroups. In another embodiment, this combination may be performed bycreating an index array that maps sequential or stride indices toscattered actual locations in memory. Further details of bothembodiments are provided later below.

Referring now to FIG. 9, another generalized block diagram illustratingone embodiment of code transformation by removing control flow transferinstructions is shown. The transformation shown is a generalization ofthe transformation between code 230 and code 232. Application code 910comprises at least function call definitions 920, functions 930-950, anIF statement 960 with a function, a THEN Path 970 with a function and anELSE Path 980 with a function. As shown, the function call 930 includesvariable initialization code 932, straight-line code 934, and an IFstatement 936 with a function, a THEN Path 938 with a function and anELSE Path 940 with a function. The components shown for application code910 are for illustrative purposes. Other components may be included orarranged in a different order.

After transformation, the application code 912 may include thecomponents used in the application code 910, but without the conditionalIF statements 960 and 936. In addition, each of the THEN Paths 970 and938 and the ELSE Paths 980 and 940 may be altered to include asurrounding function call that inputs the proper records. A similartransformation is shown in code 232. Again, the combining of the computekernel code with a record of data may be performed by shuffling, orrearranging, the records in memory into groups. Alternatively, thiscombination may be performed by creating an index array that mapssequential or stride indices to scattered actual locations in memory.

Turning now to FIG. 10, a generalized block diagram illustrating oneembodiment of code transformation by removing control flow transferinstructions and generating sub-kernels is shown. Program code 1010 hastwo IF statements. The transformations for the four possible outcomes ofthe IF statements and the resulting function code is shown to the right.The transformations remove conditional statements and are performed in amanner as shown in the above descriptions. For example, Function code1012 shows a result of a transformation of program code 1010 for a givenrecord of data that causes both branches to fail. Function code 1014shows a result of a transformation of program code 1010 for a givenrecord of data that causes the first branch to fail and the secondbranch to pass. Function code 1016 and 1018 show results oftransformations for records of data that cause the remaining twopossible results.

Turning now to FIG. 11, one embodiment of a method 1100 for optimizingparallel execution of multiple work units in a processor by utilizingpre-runtime data information is shown. The components embodied in theprocessing node 110 and the hardware resource assignments shown in FIG.4 described above may generally operate in accordance with method 1100.For purposes of discussion, the steps in this embodiment and subsequentembodiments of methods described later are shown in sequential order.However, some steps may occur in a different order than shown, somesteps may be performed concurrently, some steps may be combined withother steps, and some steps may be absent in another embodiment.

In block 1102, a software program or subroutine may be located andanalyzed. This software program may be written for compilation andexecution on a heterogeneous multi-core architecture. Program code mayrefer to any portion of a software application, subroutine, dynamiclinked library, or otherwise. A pathname may be entered at a commandprompt by a user, a pathname may be read from a given directorylocation, or elsewhere, in order to begin compiling the source code. Theprogram code may be written by a designer in a high-level language suchas C, a C-like language such as OpenCL™, and so forth. In oneembodiment, the source code is statically compiled. In such anembodiment, during a static front-end compilation, the source code maybe translated to an intermediate representation (IR). A back-endcompilation step may translate the IR to machine code. The staticback-end compilation may perform various transformations andoptimizations. In another embodiment, the source code is compiled with aJust-In-Time (JIT) method. The JIT method may generate an appropriatebinary code after obtaining the system configuration. With eithermethod, the compiler may identify a compute kernel in the program code.

In block 1104, the compiler may read one or more instructions of thecompute kernel and analyze them. A conditional statement may be acontrol flow transfer instruction, such as a branch. Different types ofcontrol flow transfer instructions may include forward/backwardbranches, direct/indirect branches, jumps, and so forth. It may bepossible for a compiler or other tool to statically determine adirection of a branch and/or a target of a branch. However, in oneembodiment, some processing typically performed during runtime onassociated data may be performed during compilation. For example, asimple test to determine a direction (taken, not-taken) of a branch maybe performed. Although, compilation may be referred to as “staticcompilation”, one or more dynamic operations may be performed. Thiscompilation may also be referred to as “pre-runtime compilation”.Another example of a dynamic step performed at this time is identifyinga next instruction to execute in each of a THEN, ELSE IF and ELSE blocksof an If-Then-ElseIf-Else construct.

If a conditional statement is not identified (conditional block 1106),then in block 1110, any remaining analysis and instrumentation iscompleted and work units are scheduled for runtime execution. If aconditional statement is identified (conditional block 1106), and datais available for pre-runtime evaluation (conditional block 1112), thenin block 1114, the access of data for runtime execution of computekernels is altered based on the branch results. For example, thecombining of the compute kernel code with a record of data may beperformed by shuffling, or rearranging, the records in memory intogroups. Alternatively, this combination may be performed by creating anindex array that maps sequential or stride indices to scattered actuallocations in memory.

Turning now to FIG. 12, a generalized block diagram illustrating oneembodiment of data shuffling in memory is shown. Memory 1210 may be mainmemory such as DRAM. In addition, the contents stored in Memory 1210 maybe stored in one or more levels of a cache memory subsystem. Theapplication data 1220 may store records of data for a given softwareapplication, wherein each record may include one or more fieldscomprising data values. As shown, application data 1220 may includerecords 1222-1230.

The original records order 1240 may not yield optimal parallel executionof multiple work units generated from compute kernels. Therefore, asdescribed in block 1112 of method 1100, the code for compute kernels maybe analyzed. Given instructions may be evaluated with associated recordsof data. Based on the results, the records may be rearranged in memoryto provide optimal parallel execution of the generated work units. Inone embodiment, within a given group of records, each associated workunit may return a same result for one or more conditional instructionslike a branch.

The records 1222 and 1226 may provide the same results for two branchesin the example shown. Here, each of records 1222 and 1226 may fail eachof the two branches, as do other records in the arrangement 1260.Therefore, each of records 1222 and 1226 may be moved to data group1250. Similarly, each of records 1224 and 1228 may fail a first branchand pass a second branch, as do other records in the arrangement 1280.Therefore, each of records 1224 and 1228 may be moved to data group1270.

The work units associated with data group 1250 may be scheduled togetherfor execution. The work units associated with data group 1270 may bescheduled together for execution after the work units associated withdata group 1250. In a simple example, originally record 1222 may havebeen associated with a work unit ID 0, record 1224 may have beenassociated with a work unit ID 1, record 1226 may have been associatedwith a work unit ID 2, and so forth. However, after the data shuffling,the record 1222 may still be associated with a work unit ID 0, butrecord 1226 may be associated with a work unit ID 1. A record (notshown) following record 1226 in data group 1250 may be associated with awork unit ID 2, and so forth. The record 1224 may be associated with awork unit ID following a work unit ID for a last record in data group1250. Referring again to FIG. 8, the code in the compute kernel“KernelFunctionA” may be executed on the records in data group 1250. Thecode in the compute kernel “KernelFunctionB” may be executed on therecords in data group 1270.

As shuffling of data in memory may include copying large amounts ofdata, an alternative method includes creating an index array andaccessing the data via the index array. Referring now to FIGS. 13 and14, one embodiment of creating an index array for data access is shown.The method 1400 of FIG. 14 will be explained together with the exampleshown in FIG. 13. In this example, eight records of data have recordidentifiers (IDs) 1302 of 0 to 7 for ease of illustration. These recordIDs and corresponding data may be stored in a records array.

In block 1402, an index N may be reset to 0 and the code of a givencompute kernel may be analyzed. The index N may maintain a count ofbranch instructions. In block 1404, a direction of a detected branchinstruction may be evaluated. For a given conditional statement in acompute kernel, the branch results 1304 are as shown, wherein a binary 1indicates “Taken/Pass” and a binary “0” indicates “Not-Taken/Fail”. Inother examples, the indications of the binary values may be switched. Aconditional statement in the compute kernel code may include an“EvaluateFunction” as shown in code 230 in FIG. 3. The conditional“EvaluateFunction” may be invoked on each record to generate a bitmapindicating the associated directions of the branch for each record. The“EvaluateFunction” may return a binary index that may be used todetermine a direction (“Taken”, “Not-Taken”) of a given branch dependingon the data in a given record. The branch results 1304 may be stored ina mask array. In block 1406, a number of partitions may be determinedbased on the count N. The number of partitions may indicate a number ofnew compute kernels to generate. Referring again to FIG. 10, for Nbranches in a compute kernel, there are 2^(N) functions generated.

In one embodiment, in order to efficiently determine a mapping 1320between work unit IDs 1310 and a created index array 1312 used foractual parallel execution of the work units, a prefix sum technique maybe used for parallel processing. Generally speaking, a prefix sum is anoperation on lists in which each element in the result list is obtainedfrom the sum of the elements in an operand list up to its index. Forexample, an input array with n values, {x₀, . . . , x_(n-1)}, may beused to produce an output array {y₀, y₁, y₂, . . . , y_(n-1)} where{y₀=x₀, y₁=y₀+x₁, . . . y_(i)=y_(i-1)+x_(i)}. This prefix sum 1306 maybe stored in a sum array. The prefix sum technique may utilize one ormore instructions already supported by a processor. In block 1408, theprefix sum 1306 may be generated from the branch results 1304.

In block 1410, an index array may be determined for each partition foundin block 1406. In one embodiment, index arrays may be generated byutilizing the algorithm 1510 as shown in FIG. 15. The algorithm 1510 isfurther described below. Both index array0 1312 and index array1 1314may be generated using the algorithm 1510 and the prefix sum 1306. Amapping 1320 may be generated using the work unit IDs 1310 and the indexarrays 1312 and 1314. Here, the work unit IDs 1310 have the same valuesas the record IDs 1302 for ease of illustration. The record IDs 1302 maybe originally assigned to the work unit IDs 1310, wherein thisassignment is based on sequential locations in memory and sequentialincrements of an ID pointer. For example, the original records order1240 may be used. If a last marked branch is reached (conditional block1414), then in block 1416, the final index arrays and generatedfunctions may be used for scheduling the work units and execution. Forexample, work unit ID 2 may be associated with a compute kernel withfunction code corresponding to a Taken Path and the record ID 3. Thework unit ID 5 may be associated with a compute kernel with functioncode corresponding to a Not-Taken Path and the record ID 4. Thefunctions generated from code within a compute kernel due to the branchremoval may be referred to as compute sub-kernels.

Referring to FIG. 15, one embodiment of an algorithm 1510 for generatingindex arrays is shown. In one embodiment, the steps in algorithm 1510may be used to generate index array0 1312 and index array1 1314 in theexample shown in FIG. 13. The branch results 1304 may be stored in themaskArray shown in the algorithm. The prefix sum 1306 may be stored inthe sumArray shown in the algorithm.

For each taken branch indicated as a binary “1” in the branch results1304, the “Then” path of the algorithm 1510 may be executed. Here, anindex is set to one less than a prefix sum value associated with a giventaken branch and a record that caused the taken direction. An indexarray associated with a given partition determined in step 1406 ofmethod 1400 is updated with an ID of the taken branch. In oneembodiment, the ID is the ID of the record that produced the takendirection of the given branch.

For each not-taken branch indicated as a binary “0” in the branchresults 1304, the “Else” path of the algorithm 1510 may be executed.Here, an index is set to the record ID value less the value of a prefixsum value associated with a given taken branch and the record thatcaused the not-taken direction. An index array associated with adifferent partition than the partition described above for the “Then”path and determined in step 1406 of method 1400 is updated with an ID ofthe taken branch. When each of the records is traversed, an associatedindex array is constructed for each partition.

Referring to FIG. 16, one embodiment of source code 1610 defining thecompute kernels utilizing the index arrays is shown. In one embodiment,particular branches are marked for evaluation during the index arraygeneration. There may be a large number of branches in a given computekernel. Rather than perform analysis for each branch, a softwareprogrammer, a compiler or another tool may determine which branches aremarked for analysis and index array generation. Similar to code 210 andcode 230 shown in FIG. 2 and FIG. 3, respectively, the code 1610includes two function calls entitled “doWorkA” and “doWorkB”. Again,each function call may be referred to as a “compute kernel”. Here, onlyone of the two compute kernels may be executed during runtime. The IDsof a record array maybe sequentially traversed. However, an index arraymay be accessed by a given record ID and a mapped value for the recordID is provided. The mapped value may be used to access a given recordfor execution of the an associated function of the two functions“doWorkA” and “doWorkB”. For example, referring again to FIG. 13, therecord IDs 0-3 are mapped by the index array0 1312 to IDs 0, 1, 3 and 6.The associated records are combined with the code used for a path for ataken branch. The record IDs 4-7 are mapped by the index array1 1314 toIDs 2, 4, 5 and 7. The associated records are combined with the codeused for a path for a not-taken branch. In the code 1610, these pathsmay be defined by the code in the functions “doWorkA” and “doWorkB”.

Turning now to FIG. 17, one embodiment of index array generation for twobranches is shown. Generation 1710 illustrates the steps previouslydiscussed in FIG. 13 where only one branch was discussed. Generation1720 expands on the generation 1710 when a second branch instruction isdetected in a compute kernel. With two branches, four index arrays aregenerated. The index array0 corresponds to both branches being taken andincludes record ID 0. The index array1 corresponds to the first branchbeing taken and the second branch being not-taken. The index array1includes record IDs 1, 3 and 6. The index array2 corresponds to thefirst branch being not-taken and the second branch being taken. Theindex array2 includes record IDs 2, 4 and 5. Finally, the index array3corresponds to both branches being not-taken. The index array3 includesrecord ID 7.

With index array generation and subsequent remapping of the access ofrecords of data during execution, the computation units within a workgroup are enabled without reshuffling data in memory. Some processorsmay contain a prefix sum instruction that can be used to accelerate thegeneration process. In such an embodiment, the data is not reshuffledback into an original order once the computation is complete. However,in some embodiments the generated index arrays may be used to reshufflethe data in memory and after execution of the compute kernels andcompute sub-kernels, the index arrays may be used to return the data tooriginal locations. The reshuffled data may be more coalesced, orcompact, in memory. Coalesced data typically provides better performanceon GPUs that may have no, or limited, caching mechanisms. Accordingly,the benefit of increased performance during execution may outweigh thecost of reshuffling records of data in memory. The generated indexarrays may be used to rearrange the record data into a different memorylayout, such as changing a row-oriented arrangement into acolumn-oriented arrangement, or vice-versa.

In one embodiment, the compiler may analyze the control flow testdecisions of a compute kernel and produces compute sub-kernels as shownabove in FIG. 10 to handle more general control flow graphs. The GPUhardware may be enhanced to produce a logical bitmask with Booleanresults of the control flow decisions. A local slice (workgroup-sized)is accessible to the compute kernel. At the control flow decision point,a decision bitmask may be processed to produce a set of indices whichthis set of instances of the compute kernel continues processing. Forexample, the GPU registers corresponding to the local_Id and global_Iddesignators may be updated. Essentially, the kernel assumes a new“identity” at this point. If this compute kernel instance contains livedata (in registers) that was created dependent on the compute kernel ID,then the compiler may generate code to store this data in memory to beconsumed by the proper instance of the compute kernel which assumes thisidentity. Alternatively, the compiler may elect to terminate the computekernel and generate a new compute kernel that is invoked at this point,completing the execution.

In another embodiment, an architecture with a low-cost compute kerneldispatch and memory sharing between a CPU and a GPU may have the CPUexecute the control flow graph, and have a corresponding compute kernelfor each basic block of the control flow graph. The CPU may be in chargeof dispatching the proper compute kernels, which do not have controlflow, at each decision point in the control flow graph.

It is noted that the above-described embodiments may comprise software.In such an embodiment, the program instructions that implement themethods and/or mechanisms may be conveyed or stored on a computerreadable medium. Numerous types of media which are configured to storeprogram instructions are available and include hard disks, floppy disks,CD-ROM, DVD, flash memory, Programmable ROMs (PROM), random accessmemory (RAM), and various other forms of volatile or non-volatilestorage. Generally speaking, a computer accessible storage medium mayinclude any storage media accessible by a computer during use to provideinstructions and/or data to the computer. For example, a computeraccessible storage medium may include storage media such as magnetic oroptical media, e.g., disk (fixed or removable), tape, CD-ROM, orDVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, or Blu-Ray. Storage media mayfurther include volatile or non-volatile memory media such as RAM (e.g.synchronous dynamic RAM (SDRAM), double data rate (DDR, DDR2, DDR3,etc.) SDRAM, low-power DDR (LPDDR2, etc.) SDRAM, Rambus DRAM (RDRAM),static RAM (SRAM), etc.), ROM, Flash memory, non-volatile memory (e.g.Flash memory) accessible via a peripheral interface such as theUniversal Serial Bus (USB) interface, etc. Storage media may includemicroelectromechanical systems (MEMS), as well as storage mediaaccessible via a communication medium such as a network and/or awireless link.

Additionally, program instructions may comprise behavioral-leveldescription or register-transfer level (RTL) descriptions of thehardware functionality in a high level programming language such as C,or a design language (HDL) such as Verilog, VHDL, or database formatsuch as GDS II stream format (GDSII). In some cases the description maybe read by a synthesis tool which may synthesize the description toproduce a netlist comprising a list of gates from a synthesis library.The netlist comprises a set of gates which also represent thefunctionality of the hardware comprising the system. The netlist maythen be placed and routed to produce a data set describing geometricshapes to be applied to masks. The masks may then be used in varioussemiconductor fabrication steps to produce a semiconductor circuit orcircuits corresponding to the system. Alternatively, the instructions onthe computer accessible storage medium may be the netlist (with orwithout the synthesis library) or the data set, as desired.Additionally, the instructions may be utilized for purposes of emulationby a hardware based type emulator from such vendors as Cadence®, EVE®,and Mentor Graphics®.

Although the embodiments above have been described in considerabledetail, numerous variations and modifications will become apparent tothose skilled in the art once the above disclosure is fully appreciated.It is intended that the following claims be interpreted to embrace allsuch variations and modifications.

1. A computer implemented method comprising: identifying a branchinstruction in a compute kernel within a computer program; generating aplurality of compute sub-kernels, each corresponding to a unique outcomeof the branch and comprising code from the compute kernel; and producinga plurality of work units by assigning one or more records of datacorresponding to a given outcome of the branch to one of the pluralityof compute sub-kernels corresponding to the given outcome.
 2. The methodas recited in claim 1, further comprising removing the branch from acompiled version of the computer program.
 3. The method as recited inclaim 1, further comprising scheduling the sub-kernels for execution onat least one of a first processor core or a second processor core. 4.The method as recited in claim 1, wherein assigning said one or morerecords of data comprises moving said one or more records of data to asame location in a memory for sequential or stride based access.
 5. Themethod as recited in claim 1, wherein assigning said one or more recordsof data comprises remapping access from originally assigned sequentialrecords to said one or more records.
 6. The method as recited in claim5, wherein remapping for each of the plurality of compute sub-kernels isdone in a parallel manner.
 7. The method as recited in claim 1, furthercomprising utilizing prefix sums based on branch outcomes to remapaccess from originally assigned sequential records to said one or morerecords.
 8. The method as recited in claim 7, wherein the secondprocessor core is a graphics processing unit configured to compute theprefix sum.
 9. A computing system including a multi-core architecturecomprising: a processor; and a memory storing program instructions;wherein the program instructions are executable by the processor to:analyze a computer program; identify a branch instruction in a computekernel within a computer program; evaluate the branch with a givenrecord of data to determine an outcome; generate a plurality of computesub-kernels, each comprising code from the compute kernel correspondingto a unique outcome of the branch; and produce a plurality of work unitsto be invoked in the compiled computer program by assigning one or morerecords of data corresponding to a given outcome of the branch to one ofthe plurality of compute sub-kernels associated with the given outcome.10. The computing system as recited in claim 9, wherein the programinstructions are further executable to remove the branch from a compiledversion of the computer program.
 11. The computing system as recited inclaim 9, further comprising scheduling the work units for execution onthe processor.
 12. The computing system as recited in claim 9, whereinassigning said one or more records of data to said one of the pluralityof compute sub-kernels comprises moving said one or more records of datato a same group location in a memory for sequential or stride access.13. The computing system as recited in claim 9, wherein assigning saidone or more records of data to said one of the plurality of computesub-kernels comprises remapping access from originally assignedsequential records to said one or more records.
 14. The computing systemas recited in claim 13, wherein remapping for each of the plurality ofcompute sub-kernels is done in a parallel manner.
 15. The computingsystem as recited in claim 14, wherein the parallel remapping utilizes aprefix sum technique based on branch outcomes.
 16. The computing systemas recited in claim 9, wherein the second processor core is configuredto compute and utilize prefix sums based on branch outcomes to remapaccess from originally assigned sequential records to said one or morerecords.
 17. A computer readable storage medium storing programinstructions, wherein the program instructions are executable to:identify a branch instruction in a compute kernel within a computerprogram; generate a plurality of compute sub-kernels, each correspondingto a unique outcome of the branch and comprising code from the computekernel; and produce a plurality of work units by assigning one or morerecords of data corresponding to a given outcome of the branch to one ofthe plurality of compute sub-kernels corresponding to the given outcome.18. The computer readable storage medium as recited in claim 17, whereinthe program instructions are further executable to remove the branchfrom a compiled version of the computer program.
 19. The computerreadable storage medium as recited in claim 17, wherein assigning saidone or more records of data to said one of the plurality of computesub-kernels comprises moving said one or more records of data to a samegroup location in a memory for sequential or stride access.
 20. Thecomputer readable storage medium as recited in claim 17, wherein theprogram instructions are further executable to compute and utilizeprefix sums based on branch outcomes to remap access from originallyassigned sequential records to said one or more records.